Section 1.2: Classifying and Storing Data Sources
Population
- A target group we want to study.
- The collection of all data from that group.
- It is often difficult (if not impossible) to obtain all data from the population.
Sample
- A subset of the population.
- Should represent the population as a whole
- It is typically easier (and sometimes only possible) to collect data from a sample.
Suppose you want to know what the predominant eye color in your country is. You survey a random sample of 2,500 people in your country, asking them about their eye color.
- Who is the population?
- Who is the sample?
- What data was collection?
- The population is everyone in your country.
- The sample is the 2,500 people who were surveyed.
- The data collected were participants’ eye color.
Classifying Data
- A data variable is a characteristic that is measured or recorded.
- There two types of variables: categorical and numerical.
- A variable is categorical if it describes a quality or a class.
- A categorical variable may use numbers as labels, but arithmetic operations on those numbers are not meaningful.
- Examples: Eye colors, zip code, letter grades, Social Security number (SSN)
- A variable is numerical if it describes a quantity or a measurement.
- Examples: height, weight, temperature
Classify each of the following variables as numerical or categorical.
| Variable | Numerical | Categorical |
|---|---|---|
| Height of a building | ||
| Letter grade on a test | ||
| Hours of sleep each night | ||
| Students’ GPAs | ||
| Types of cars | ||
| Vegetable varieties planted in a garden | ||
| Number of vegetable varieties planted in a garden |
| Variable | Numerical | Categorical |
|---|---|---|
| Height of a building | X | |
| Letter grade on a test | X | |
| Hours of sleep each night | X | |
| Students’ GPAs | X | |
| Types of cars | X | |
| Vegetable varieties planted in a garden | X | |
| Number of vegetable varieties planted in a garden | X |
Sorting Data
A coded data is data that uses numbers to represent information, which can make the data easier to record and interpret.
When a variable is binary (i.e., it has only two possible values), we often code it using 0 and 1, where 0 means false and 1 means true.
Suppose a local animal shelter received a litter of five surrendered puppies. A volunteer named each puppy and identified its sex in the table below. The manager wants to count the number of female puppies, so she asked you to add a new column named “Female”. How would you code this new column?
| Name | Sex |
|---|---|
| Daisy | Female |
| Hazel | Female |
| Luna | Female |
| Milo | Male |
| Rocky | Male |
If a puppy is female, assign a value of 1. Otherwise, assign a value of 0.
| Name | Sex | Female |
|---|---|---|
| Daisy | Female | 1 |
| Hazel | Female | 1 |
| Luna | Female | 1 |
| Milo | Male | 0 |
| Rocky | Male | 0 |
Stacked data are data values with the following characteristics:
- Each column represents a variable.
- Each row contains data for a single observation/individual.
- Stacked data can store multiple variables across multiple observations.
The table below shows data on dogs in a local animal shelter. Each row corresponds to a single dog.
- Identify the variables.
- How many dogs are in the table?
| Weight (lbs) | Gender | Illness |
|---|---|---|
| 10 | M | N |
| 27 | F | Y |
| 6 | F | N |
| 45 | M | N |
| 65 | M | N |
| 33 | F | Y |
- The variables are weight, gender, and illness.
- There are 6 rows in the table; therefore, there are 6 dogs.
Unstacked data are data with the following characteristics:
- Data values are stored in two columns.
- Each column represents a variable from a different group.
- Unstacked data can only store data for two groups.
- Each row does not correspond to the same individual or observation.
The unstacked table below shows the average number of hours slept over a one-week period for a sample of men and women.
| Men | Women |
|---|---|
| 6.4 | 6.2 |
| 7.2 | 7.5 |
| 8.1 | 7.9 |
| 6.7 | 8.0 |
| 7.0 | |
| 6.9 |